Checking Downloaded Videos

Due all the videos are located at YouTube, not all the videos have been downloaded (due to geografic restrictions, removed by the owner), so I must clean and store all the information of the successfully downloaded videos.

First step is to chech which videos have been downloaded.


In [4]:
import os
import json

DOWNLOAD_DIR = '/imatge/amontes/work/datasets/ActivityNet/v1.3/videos'

videos = os.listdir(DOWNLOAD_DIR)

videos_ids = []
for video in videos:
    videos_ids.append(video.split('.mp4')[0])

Now lets load the original dataset and remove all the videos which have not been downloaded.


In [5]:
with open('../dataset/originals/activity_net.v1-3.min.json', 'r') as f:
    dataset = json.load(f)
print('Number of videos of the original dataset: {} videos.'.format(len(dataset['database'].keys())))

for key in dataset['database'].keys():
    if key not in videos_ids:
        del dataset['database'][key]

print('Number of videos successfully downloaded: {} videos'.format(len(dataset['database'].keys())))


Number of videos of the original dataset: 19994 videos.
Number of videos successfully downloaded: 19792 videos

In [6]:
with open('../dataset/tmp/dataset_downloaded.json', 'w') as f:
    json.dump(dataset, f)

Now, a information very important to extract for each video are its number of frames. This would be very helpful for future computations, so lets run the script at python/tools/:

python get_nb_frames.py ../../dataset/tmp/dataset_downloaded.json ../../dataset/tmp/dataset_downloaded_nb_frames.json

In [10]:
with open('../dataset/tmp/dataset_downloaded_nb_frames.json', 'r') as f:
    dataset = json.load(f)
# Removing the videos which has been impossible to read with OpenCV (a minor number)
for key in dataset['database'].keys():
    if dataset['database'][key]['num_frames'] is None:
        del dataset['database'][key]

print('Number of videos successfully downloaded: {}'.format(len(dataset['database'].keys())))


Number of videos successfully downloaded: 19757

Now with all the dataset I am going to work with, I'll store separetly the videos information and labels available. Because the labels are represented as a tree of activities, I'll get only the leaf nodes because this are the same labels the videos are tagged with.


In [11]:
taxonomy = dataset['taxonomy']
all_node_ids = [x["nodeId"] for x in taxonomy]
leaf_node_ids = []
for x in all_node_ids:
    is_parent = False
    for query_node in taxonomy:
        if query_node["parentId"]==x: is_parent = True
    if not is_parent: leaf_node_ids.append(x)
leaf_nodes = [x for x in taxonomy if x["nodeId"] in  leaf_node_ids]

with open('../dataset/labels.txt', 'w') as f:
    # Write down the none activity
    f.write('{}\t{}\n'.format(0, 'none'))
    for i in range(len(leaf_nodes)):
        f.write('{}\t{}\n'.format(i+1, leaf_nodes[i]['nodeName']))
        
with open('../dataset/videos.json', 'w') as f:
    json.dump(dataset['database'], f)